论文地址:https://arxiv.org/pdf/2109.06668.pdf 本⽂介绍深度强化学习领域第⼀篇系统性的综述⽂章Exploration in Deep Reinforcement Learning: A Comprehensive Survey。该综述⼀共调研了将近200篇⽂献,涵盖了深度强化学习和多智能体深度强化学习两⼤领域近100种探索算法。总的来说,该综述的贡献主要可以总结为以下四⽅⾯:
主要作者介绍 杨天培博⼠,现任University of Alberta博⼠后研究员。杨博⼠在2021年从天津⼤学取得博⼠学位,她的研究兴趣主要包括迁移强化学习和多智能体强化学习。杨博⼠致⼒于利⽤迁移学习、层次强化学习、对⼿建模等技术提升强化学习和多智能体强化学习的学习效率和性能。⽬前已在IJCAI、AAAI、ICLR、NeurIPS等顶级会议发表论⽂⼗余篇,担任多个会议期刊的审稿⼈。 汤宏垚博⼠,天津⼤学博⼠在读。汤博⼠的研究兴趣主要包括强化学习、表征学习,其学术成果发表在AAAI、IJCAI、NeurIPS、ICML等顶级会议期刊上。 ⽩⾠甲博⼠,哈尔滨⼯业⼤学博⼠在读,研究兴趣包括探索与利⽤、离线强化学习,学术成果发表在ICML、NeurIPS等。 刘⾦毅,天津⼤学智能与计算学部硕⼠在读,研究兴趣主要包括强化学习、离线强化学习等。 郝建业博⼠,天津⼤学智能与计算学部副教授。主要研究⽅向为深度强化学习、多智能体系统。发表⼈⼯智能领域国际会议和期刊论⽂100余篇,专著2部。主持参与国家基⾦委、科技部、天津市⼈⼯智能重⼤等科研项⽬10余项,研究成果荣获ASE2019、DAI2019、CoRL2020最佳论⽂奖等,同时在游戏AI、⼴告及推荐、⾃动驾驶、⽹络优化等领域落地应⽤。 Reference[1]P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the multiarmed bandit problem,” Machinelearning, vol. 47, no. 2-3, pp. 235–256, 2002.[2]I. Osband, B. V. Roy, and Z. Wen, “Generalization and exploration via randomized value functions,” inInternational Conference on Machine Learning, 2016, pp. 2377–2386.[3]I. Osband, C. Blundell, A. Pritzel, and B. V. Roy, “Deep exploration via bootstrapped DQN,” in Advances inNeural Information Processing Systems 29, 2016, pp. 4026–4034.[4]K. Ciosek, Q. Vuong, R. Loftin, and K. Hofmann, “Better exploration with optimistic actor critic,” inAdvances in Neural Information Processing Systems, 2019, pp. 1785–1796.[5]C. Bai, L. Wang, L. Han, J. Hao, A. Garg, P. Liu, and Z. Wang, “Principled exploration via optimisticbootstrapping and backward induction,” in International Conference on Machine Learning, 2021.[6]J. Kirschner and A. Krause, “Information directed sampling and bandits with heteroscedastic noise,” inConference On Learning Theory, 2018, pp. 358–384.[7]B. Mavrin, H. Yao, L. Kong, K. Wu, and Y. Yu, “Distributional reinforcement learning for efficientexploration,” in International Conference on Machine Learning, 2019, pp. 4424–4434.[8]D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven exploration by self-supervisedprediction,” in International Conference on Machine Learning, 2017, pp. 2778–2787.[9]H. Kim, J. Kim, Y. Jeong, S. Levine, and H. O. Song, “EMI: exploration with mutual information,” inInternational Conference on Machine Learning, 2019, pp. 3360–3369.[10]Y. Burda, H. Edwards, A. J. Storkey, and O. Klimov, “Exploration by random network distillation,” inInternational Conference on Learning Representations, 2019.[11]R. Y. Tao, V. François-Lavet, and J. Pineau, “Novelty search in representational space for sample efficientexploration,” in Advances in Neural Information Processing Systems, 2020.[12]Y. Du, L. Han, M. Fang, J. Liu, T. Dai, and D. Tao, “LIIR: learning individual intrinsic reward in multi-agentreinforcement learning,” in Advances in Neural Information Processing Systems, 2019, pp. 4405– 4416 [13]R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. D. Turck, and P. Abbeel, “VIME: variational information maximizing exploration,” in Advances in Neural Information Processing Systems, 2016, pp. 1109–1117. [14]T. Wang, J. Wang, Y. Wu, and C. Zhang, “Influence-based multi-agent exploration,” in International Conference on Learning Representations, 2020[15]D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. van Hasselt, and D. Silver, “Distributed prioritized experience replay,” in International Conference on Learning Representations, 2018. [16]S. Kapturowski, G. Ostrovski, J. Quan, R. Munos, and W. Dabney, “Recurrent experience replay in distributed reinforcement learning,” in International Conference on Learning Representations, 2019. [17]M. Fortunato, M. G. Azar, B. Piot, J. Menick, M. Hessel, I. Osband, A. Graves, V. Mnih, R. Munos, D. Hassabis, O. Pietquin, C. Blundell, and S. Legg, “Noisy networks for exploration,” in International Conference on Learning Representations, 2018.[18]E. Adrien, H. Joost, L. Joel, S. K. O, and C. Jeff, “First return, then explore,” Nature, vol. 590, no. 7847, pp.580–586, 2021.[19]A. Mahajan, T. Rashid, M. Samvelyan, and S. Whiteson, “MAVEN: multi-agent variational exploration,” inAdvances in Neural Information Processing Systems, 2019, pp. 7611–7622.